Today’s focus

What data to use in introductory statistics and data science courses? Ideally data that’s

  1. Rich enough to answer meaningful questions with
  2. Real enough to ensure that there is context
  3. Realistic enough to convey to the reality of much of the world’s data

One goal

On the one hand, Cobb (2015) argues that we should

  1. “Teach through research”
  2. “Minimize prerequisites to research”

Another goal

On the other hand, from New York Times:

Drawing

Analogy for second goal

Two conflicting goals

  • On the one hand: Minimize prerequisites to research
  • On the other: Do not betray reality of data as it exists outside classroom

Back to analogy

In other words, a balancing act is required between

Data with no prerequisites needed Data as it exists “in the wild”
Drawing Drawing

Data “taming”

Data “taming” sets out to

  • On the one hand: Perform enough pre-processing so that data is accessible to R novices
  • On the other: Not perform so much pre-processing as to betray the reality of data as it exists “in the wild”

“Tame” data principles

We propose the following “tame” data principles to remove biggest hurdles R novices face.

  1. Clean variable names
  2. ID variables in left-hand columns
  3. Clean dates (More generally: clean numerical representations)
  4. Clean categorical variables
  5. Consistent “tidy” format

fivethirtyeight package

The fivethirtyeight R package:

  • Takes FiveThirtyEight’s raw article data from GitHub
  • Includes pre-processed raw data that follows “tame” data principles
  • Makes data, documentation, and original article easily accessible via an R package

Examples are in R, so I suggest you follow in HTML version of this talk available at bit.ly/causeweb_tame

Principle 1: Clean variable names

a) Comparing raw and tamed data

library(readr)
library(fivethirtyeight)

# Raw data: variable names are unwieldy & have spaces
flying_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv")
colnames(flying_raw)[c(5, 19)]
## [1] "Do you have any children under 18?"               
## [2] "In general, is itrude to bring a baby on a plane?"
# Tamed data: corresponding variable names are cleaner
colnames(flying)[c(5, 18)]
## [1] "children_under_18" "baby"

b) Why should we care?

Working with variables names that are long/unwieldy and have spaces is a tricky.

mosaicplot(~ `Do you have any children under 18?` + `In general, is itrude to bring a baby on a plane?`, 
           data = flying_raw,  main = "Raw data",
           xlab = "Have a baby?", ylab = "Is it rude?")
mosaicplot(~ children_under_18 + baby,
           data = flying_raw,  main = "Raw data",
           xlab = "Have a baby?", ylab = "Is it rude?")

Principle 2: ID variables

More organizational. Any identification variables that uniquely identify the observations/rows should be place in the left-hand columns since they are of highest prominence. Such variables are used to key joins/merging of datasets.

library(dplyr)
library(fivethirtyeight)

# Both title and imdb site tag uniquely identify movies
biopics %>% 
  sample_n(3)
title site country year_release box_office director number_of_subjects subject type_of_subject race_known subject_race person_of_color subject_sex lead_actor_actress
Schindler’s List tt0108052 US 1993 96100000 Steven Spielberg 1 Oskar Schindler Other Known White FALSE Male Liam Neeson
The Diary of Anne Frank tt0052738 US 1959 NA George Stevens 1 Anne Frank Other Known White FALSE Female Millie Perkins
Lady Sings the Blues tt0068828 US 1972 9600000 Sidney J. Furie 1 Billie Holiday Musician Known African American TRUE Female Diana Ross
# episode uniquely identifies episodes of "The Joy of Painting"
bob_ross %>% 
  sample_n(3)
episode season episode_num title apple_frame aurora_borealis barn beach boat bridge building bushes cabin cactus circle_frame cirrus cliff clouds conifer cumulus deciduous diane_andre dock double_oval_frame farm fence fire florida_frame flowers fog framed grass guest half_circle_frame half_oval_frame hills lake lakes lighthouse mill moon mountain mountains night ocean oval_frame palm_trees path person portrait rectangle_3d_frame rectangular_frame river rocks seashell_frame snow snowy_mountain split_frame steve_ross structure sun tomb_frame tree trees triple_frame waterfall waves windmill window_frame winter wood_framed
S04E03 4 3 MAJESTIC MOUNTAINS 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
S15E06 15 6 WAVES OF WONDER 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
S24E07 24 7 BACK COUNTRY 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0

Principle 3: Dates

a) Comparing raw and tamed data

library(readr)
library(fivethirtyeight)

# Raw data: year, month, day are separate variables
US_births_1994_2003_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv")
head(US_births_1994_2003_raw)
year month date_of_month day_of_week births
1994 1 1 6 8096
1994 1 2 7 7772
1994 1 3 1 10142
1994 1 4 2 11248
1994 1 5 3 11053
1994 1 6 4 11406
# Tamed data: variable date of type "date" included
head(US_births_1994_2003)
year month date_of_month date day_of_week births
1994 1 1 1994-01-01 Sat 8096
1994 1 2 1994-01-02 Sun 7772
1994 1 3 1994-01-03 Mon 10142
1994 1 4 1994-01-04 Tues 11248
1994 1 5 1994-01-05 Wed 11053
1994 1 6 1994-01-06 Thurs 11406

b) Why should we care?

Without a variable of type date, making time series plots is difficult.

# Use filter command from dplyr package for data wrangling
US_births_1999 <- US_births_1994_2003 %>%
  filter(year == 1999)

# Plot time series via base R:
plot(x = US_births_1999$date, y = US_births_1999$births, type = "l", 
     xlab = "Date", ylab = "Number of births", main = "1999 US Births")

Principle 4: Categorical variables

a) Comparing raw and tamed data

library(readr)
library(ggplot2)
library(fivethirtyeight)
bechdel_raw <- read_csv("https://raw.githubusercontent.com/rudeboybert/fivethirtyeight/master/data-raw/bechdel/movies.csv")

# Raw data: categorical variable clean_test is saved as characters/strings
bechdel_raw$clean_test[1:5]
## [1] "notalk" "ok"     "notalk" "notalk" "men"
# Tamed data: clean_test is saved as factor
bechdel$clean_test[1:5]
## [1] notalk ok     notalk notalk men   
## Levels: nowomen < notalk < men < dubious < ok

b) Why should we care?

R by default plots characters in alphabetical order, whereas with factors we can set the order of the levels. In this case, we can have the bars ordered along the hierarchical nature of Bechdel test:

# Using raw data:
ggplot(bechdel_raw, aes(x = clean_test)) +
  geom_bar() +
  labs(x = "Bechdel test outcome", y = "count", title = "Raw data")

# Using tamed data:
ggplot(bechdel, aes(x = clean_test)) +
  geom_bar() +
  labs(x = "Bechdel test outcome", y = "count", title = "Tamed data")

Principle 5: “Tidy” data format

“Tidy” data format is narrow/long format, as opposed to wide. This format is chosen for input/output data frame standardization across many R packages in the tidyverse: ggplot2, dplyr, etc. There are three interrelated rules which make a dataset “tidy”:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.
Drawing

a) Comparing raw and tamed data

library(dplyr)
library(ggplot2)
library(fivethirtyeight)

# In fivethirtyeight package drinks data is kept in original non-tidy (wide) format
head(drinks)
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
Afghanistan 0 0 0 0.0
Albania 89 132 54 4.9
Algeria 25 0 14 0.7
Andorra 245 138 312 12.4
Angola 217 57 45 5.9
Antigua & Barbuda 102 128 45 4.9
# tidyr::gather() code to convert to tidy format in help file: ?drinks
library(tidyr)
drinks_tidy <- drinks %>%
  gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>% 
  arrange(country)
head(drinks_tidy)
country total_litres_of_pure_alcohol type servings
Afghanistan 0.0 beer_servings 0
Afghanistan 0.0 spirit_servings 0
Afghanistan 0.0 wine_servings 0
Albania 4.9 beer_servings 89
Albania 4.9 spirit_servings 132
Albania 4.9 wine_servings 54
ggplot(drinks_tidy, aes(x = type, y = servings)) + 
  geom_boxplot() +
  labs(x = "Alcohol type", y = "Number of servings", title = "Worldwide alcohol consumption")

Advanced example

a) Comparing raw and tamed data

In the tamed pres_2016_trail data frame we:

  1. Ensured lat and lng were in numerical format, not in degree/minute/second, North/South, and East/West format (A variation on Principle 3: Dates)
  2. Combined both CSV’s into one and added variable candidate (Principle 5: Tidy data format)
library(dplyr)
library(fivethirtyeight)

# Tamed data: 
pres_2016_trail %>% 
  arrange(date) %>% 
  head()
candidate date location lat lng
Trump 2016-09-01 Wilmington, OH 39.44534 -83.82854
Trump 2016-09-03 Detroit, MI 42.33143 -83.04575
Clinton 2016-09-05 Cleveland, Ohio 41.49932 -81.69436
Clinton 2016-09-05 Hampton, Illinois 41.55587 -90.40930
Clinton 2016-09-06 Tampa, Florida 27.95058 -82.45718
Trump 2016-09-06 Virginia Beach, VA 36.85293 -75.97799

b) Why should we care?

So we can easily create a faceted map!

library(ggplot2)
library(maps)
ggplot(data = pres_2016_trail, aes(x = lng, y = lat)) +
  facet_wrap(~candidate) +
  geom_point(col = "black", size = 2) + 
  coord_map() + 
  # Override data & aes()thetic mapping set above to trace path of state outlines:
  geom_path(data = map_data("state"), aes(x = long, y = lat, group = group), size = 0.1)

Comments

  • Analogy I like: fivethirtyeight is like a data petting zoo
  • No “universal” balance of two goals: it will vary depending on your students’ experience, requirements, and needs.
  • fivethirtyeight package used in other contexts:
    1. Intermediate-level data science courses
    2. Advanced R package development project

Used in data science courses

  1. Recruited STAT231 Data Science to “tame” datasets STAT135 Intro students found for their final projects
  2. Available on GitHub: data wrangling source code by package authors to convert 538 raw CSV data to “tamed” format process_data_sets_albert.R, process_data_sets_chester.R, process_data_sets_jen.R

Used for advanced projects

Other resources